Target specific data filter to speed processing

ABSTRACT

A method is presented which reduces data flow and thereby increases processing capacity while preserving a high level of accuracy in a distributed speech processing environment for speaker detection. The method and system of the present invention includes filtering out data based on a target speaker specific subset of labels using data filters. The method preserves accuracy and passes only a fraction of the data by optimizing target specific performance measures. Therefore, a high level of speaker recognition accuracy is maintained while utilizing existing processing capabilities.

FIELD OF THE INVENTION

The invention relates to a contemporary speech analysis method andsystem, and more particularly, to a speech analysis method and systemwhich filters out data using a data filter based on a target speakerspecific subset of labels.

BACKGROUND OF THE INVENTION

Contemporary speech analysis systems operate over a range of operatingpoints, each with a characteristic resource usage, accuracy, andthroughput. It would therefore be desirable for a speech analysis methodto address the problem of increasing throughput while minimizing theimpact on accuracy while maintaining or reducing the resource usage fora speech based detection task, in particular, speaker recognition. Itwould further be desirable for a speech analysis method to extracttime-critical information from large volumes of unstructured data whilefiltering out unimportant data that might otherwise overwhelm theavailable resources.

SUMMARY OF THE INVENTION

The invention relates to a method and system for reducing the amount ofspeech data for detecting desired speakers which comprises providing aplurality of speech data from a plurality of speakers. The speech datais reduced using a speaker dependent filter which employs a labelpriority ranking of the labeled speech data for each of the speakers.The labeled speech data is prioritized according to a performancecriterion for each of the plurality of speakers. Then, the labeledspeech data is analyzed according to the priority rank of the label ofthe speech data for a specific speaker, wherein specified lower prioritylabeled speech data is dropped or filtered out such that less speechdata is analyzed.

In a related aspect of the invention, the speakers are divided intogroups of speakers and the filters are generated and used for each groupof speakers. The data may also be pre-labeled.

In a related aspect of the invention, the labeled groups of speakers arepriority ranked.

In another aspect of the invention a speaker recognition system using acomputer comprises a plurality of speech data from a plurality ofspeakers stored on a computer. A labeler element analyzes the speechdata and labels the data. A speech data filter includes a filter elementfor arranging the labels in a priority rank and designed according to aperformance criterion for each of the plurality of speakers. Arecognition element analyzes the labeled data in accordance with thepriority rank as determined by the speech data filter.

In a related aspect of the invention the labels are phonetic.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an exemplary data reduction filter design;

FIG. 2 is a block diagram of the data reduction filter design of FIG. 1wherein the target and non-target data is previously labeled; and

FIG. 3 is a block diagram depicting the use of the data reduction filtershown in FIGS. 1 and 2.

DETAILED DESCRIPTION OF THE INVENTION

The present invention provides for data filtering in the context of adistributed speech processing architecture designed to detect oridentify speakers where the design of the filter depends on the abilityto recognize the specific desired speaker, or group of speakers.According to the present invention, a target specific data filter methodand system is provided which allows a large percentage of data to beignored while preserving substantial accuracy. Filtering can refer toselecting data, as well as, excluding data.

The target specific data filter method and system of the presentinvention applies data filtering to provide the advantage of increasingthroughput while minimizing the impact on accuracy. The filters areoptimized for the detection of a specific target. This reduction of databy filtering allows for example, more audio data to be concurrentlyprocessed. An advantage of the present invention is, for example, moredata can be processed in a fixed amount of time without an increase inresources, resulting in an increase in processing capacity.

The architecture that is presented consists of a feature extractioncomponent, a feature labeling/and or filtering component which canreduce the data, and a speaker detection component. The recognition taskitself encompasses two sub-tasks, that of identification andverification (detection). Thus, target models are associated with anoptimal set of labels with respect to recognition performance or anotherperformance measure. In cases where the identity of the speaker soughtis known, the data in the system can be filtered to pass only theoptimal labels for that speaker. Alternatively, if there is a set ofspeakers of interest, for example, the enrolled population, then anaggregate best list of labels can be chosen.

The nature of the labels, e.g. the characteristics of the alphabet, willdetermine the granularity with which the data can be filtered to achievevarious operating points on a performance vs. complexity curve. Phoneticlabels may be used to show that testing based on the filtered datapreserves accuracy.

The present invention addresses the task of filtering out data toincrease the throughput (processing capacity) while minimizing theimpact on accuracy and maintaining or reducing the resource usage for aspeech based detection task, in particular, speaker recognition.Filtering in this context means selectively dropping data. The result ofthis approach improves the capability to analyze higher volumes of datawith a limited set of resources which can remain constant.

According to the present invention, when a large amount of data is heldit can be either labeled or unlabeled. Referring to FIG. 1, if the data14, 18 is unlabeled, then the data 14, 18 is passed through a labeler.The labeled data 14 a, 18 a has to be analyzed, however, a problemarises when some systems do not have the resources to analyze the kindof data or quantity of data. The target specific data reduction filter40 is used to select the part of the data to analyze. Thus, only part ofthe data, (the filter 40 has filtered out the rest of that data) will beanalyzed by the recognition analysis component 26 c, as shown in FIG. 3.

According to the present invention, speaker recognition can be realizedas computationally isolated algorithmic components arranged in ananalysis pipeline, wherein a sequence of speech frames, each processedindividually, constitute the data stream flowing through the pipeline.These components can be used for multiple tasks. For example, when theinput is compressed audio data, the various components might includeaudio waveform decompression, speech feature extraction, intermediatefeature processing, and speaker detection. Any component can act as adata filter for downstream components, eliminating their computationtime for the filtered frame. In order to obtain the desired resultsquality filters are applied. A methodology is developed to determineeffective, high quality, filters based on the special properties of therecognition task. The phonetic labels are associated with each dataframe, but the technique could apply generally.

According to the present invention, data is filtered (i.e. passing ordropping) data in a speaker or target specific manner to increaseprocessing capacity in a distributed framework while trying to maintainaccuracy. There are two different phases of the method/system accordingto the present invention, first creating the filter, i.e., the design ofthe target specific data filtered (as shown in FIGS. 1 and 2), and thenusing the designed filter to filter the labeled target data (as shown inFIG. 3). To design the filter, the data is analyzed such that the datais essentially a sequence of vectors coming one after the other in time.Thus, the data can be thought of as a sequence of feature vectors (data)one after the other in time. Each feature vector (data) has a labelassociated with it, which for example, may be a phonetic label, thus thedata is labeled. The analysis includes the labeled data from aparticular target, the recognition performance is based on each of thelabels, as well as, based on using combined subscripts of these labels.

Referring to FIG. 3, the labeled input data 64 is filtered correspondingto the set of labels that perform the best and inputted into therecognition analysis program 26 c. Thus, the target specific datareduction filter 40 is designed in a first phase, as shown in FIGS. 1and 2, and implemented and used in a second phase, shown in FIG. 3. Thedata used to design the filter 40, in FIGS. 1 and 2, may be differentthan the data which is actually filtered, in FIG. 3, at least inquantity.

Referring to FIG. 1, the target specific data filter method and system10 includes data to be analyzed comprising target data 14 and non-targetdata 18. Both the target data 14 and non-target data 18 are unlabeled.The system 10 includes target data 14 and non-target data 18 bothentering labelers 22 a and 22 b, respectively. The labelers apply, forexample, phonetic labels, to the target data and non-target data. Otherlabeling systems may also be used, for example, a set of labelscorresponding to a set of clusters (feature vectors that naturally grouptogether) derived in some manner from development data. The speech datamay include a group of speaker or one speaker (or any subset ofspeakers). When the speech data is targeted, the speech data from thespecified speaker (or speakers) is labeled 22 a. Other speech, not fromthe targeted individuals, is labeled in 22 b and provided as non targetdata. The speech data is labeled for each data vector 14, 18 by alabeler 22 a, 22 b, resulting in labeled data 14 a and 18 a as an outputfrom the labelers 22 a, and 22 b, respectively.

Non-target data 18 is data from the speaker or speakers that are notbeing targeted. The filter 40 is created to seek out one specific entity(speaker), and the target data is data from that entity/speaker. Boththe target data 14 and non-target 18 data go through the labelers 22 aand 22 b, respectively, and are labeled with a set of phonetic units, ora set of labels, or any recognizable label. The data, as a sequence offeature vectors has a label associated with it for a particular target.The recognition performance based on each of the labels as well as basedon using combined subscripts of these labels can be ascertained. Thelabelers 22 a and 22 b may use any means to label the data, for example,a computer program. The data could be labeled with any label that willprovide the same function, such as label 1, label 2, label 3. Ingeneral, the data could be labeled based on any set of labels, providedthat each feature vector of the data belongs to at least one label.

Once the data is labeled, a recognition analysis 26 a and 26 b isprovided for each individual label and for each of groups of labels, asshown in FIG. 1. The recognition analysis 26 a and 26 b is the processof prioritizing the labels in order of accuracy, and/or may includeother performance measures. A performance analysis is done for eachlabel to show which label is most effective in retaining information fordetecting the target from the data. For example, consider the set ofphonetic labels. Looking in turn at data corresponding to each label, itmay turn out that speaker A is best detected using the data labeled“iy”. On the other hand speaker B might best be detected using datacorresponding to the “ow” label, etc. Thus, each possible label isevaluated for performance and the best label can be chosen. Non targetdata is used to test the robustness of the data reduction filter and canbe incorporated in a performance measure.

The labels are prioritized for filtering the data using a performancecriterion that uses for example the accuracy of detection and/or theamount of data reduced. An individual label for data can be evaluated ora group of labels can be evaluated as a group for filtering data. Forexample, a group of two labels might correspond to less data than athird other label, and performance with respect to that data may bebetter than that with respect to the data from the third label.

FIG. 2 shows the creation of a target specific data reduction filter 40where the data 14 a, 18 a is already labeled by another source. All thelabeled data 14 a and 18 a is inputted in the recognition analysis 26 aand 26 b and generates the filter 40 based on the filter design 30.Again, the recognition analysis 26 a and 26 b is the process ofprioritizing the labels in order of accuracy. For example, the bestlabel may be a single label, and the second best label may be acombination of labels or group of labels, as explained above.

As discussed above, different labels may be better for different targetdata. For example, to detect the presence of a first speaker in a groupof data the recognition analysis method may look at the “A” soundbecause for that speaker the best detection occurs when targeting thesound of “A”, and for a second speaker, the “E” sound may result inbetter detection.

Once the best label or group of labels is determined, the filter designcan be characterized, for example, as a rank list of label setsaccording to their performance. A large amount of data can be reduced byfiltering the data using the filter design 30 developed by therecognition analysis 26 a.

An example embodiment of using the filter 40 according to the presentinvention is shown in FIG. 3. The resulting filtered data is muchsmaller than the original large data group and can be analyzed forrecognition analysis 26 c expending much less resources, for example,computer processing requirements and storage capacity while maintaininga high level of accuracy as demonstrated in the experimental results.

The system and method of the present invention produces a ranked list oftarget specific labels (or label sets). The data reduction results fromonly keeping data corresponding to the top number (N) of labels (orsets), where N depends on the resource availability.

Thus, data can be either unlabeled and receive a label after goingthrough the labeler 22 a, 22 b (as shown in FIG. 1), or the data can bepre-labeled data (as shown in FIG. 2). Each speaker receives his ownrank system of labels. The method provides using filters which include aranking list for the performance of the labels. The labels are used toreduce a large quantity of data by either choosing data having aspecified label, and/or ignoring data not having the specified label (orlabels). Thus, the amount of data for analysis is reduced. As shown inFIG. 3, labeled input data 64 is filtered using the target datareduction filter 40 before the recognition analysis 26 c. The quantityof data is greatly reduced before the recognition analysis 26 c. Therecognition analysis step 26 c is then able to take much less time andresources to produce speaker recognition results 50.

The components of the speaker recognition system 10 may utilize acomputer and computer program. The target data 14, and non-target data18 may be stored on a single or multiple computers. The labelers 22 a,22 b, recognition analysis program 26 a, 26 b, and filter design 30 andtarget specific data reduction filter 40 may all be stored on a computeror distributed on multiple computers and use computer programs toimplement their respective functions.

Recognition Model

One embodiment according to the present invention uses a state of theart speaker recognition system, where modeling is based ontransformation enhanced models in the GMM-UBM (Gaussian MixtureModel—Universal Background Model) framework. Data is represented as asequence of vectors {x_(j)}, where i=1,2,3, . . . , and each elementcorresponds to an observed data frame. The UBM Model M_(UBM) isparameterized by

{m_(i) ^(UBM), Σ_(i) ^(UBM), p_(i) ^(UMB)}_(i=1, . . . , N) ^(UMB) andT^(UMB),

consisting of the estimates of the mean, diagonal covanance, and mixtureweight parameters for each of the N^(UBM) Gaussian components in thetransformed space specified by an MLLT transformation, T^(UBM), which ischosen to give the optimal space for restriction to diagonal covariancemodels. That is represented by the equation:

m _(i) ^(UMB) =T ^(UMB) m _(i) ^(UMB,o) and Σ_(i) ^(UMB) =diag(T^(UMB)Σ_(i) ^(UMB,o) T ^(UMB,T))

The “o” in the superscript indicates parameters derived from theoriginal untransformed data through Expectation Maximization (EM)iterations. The “T” stands for transpose. EM is a standard technique toupdate the parameters of a model. After EM, T^(UBM) is estimated basedon the resultant parameters and is subsequently applied to them toconstruct the final model. This UBM model represents the backgroundpopulation and is trained with data from a large number of speakers soas to create a model without idiosyncratic characteristics. Based onthis reference, each speaker M_(j), is parameterized by:

{m _(i) ^(j), Σ_(i) ^(j) , p _(i) ^(j)}_(i=1, . . . N) ^(UMB)

The speaker dependent MLLT, T^(j), is identical to T^(UBM), whereas moregenerally it could be different. These parameters are derived viaMaximum A Posterior i Probability (MAP) adaptation, a standard techniquefor parameter update, from the UBM parameters in the transformed space,based on speaker specific training data. The number of Gaussiancomponents is the same as that for the UBM. Thus, the observed speakertraining data {x_(i)} is transformed into the new space {T^(UBM) x_(i)}before the MAP adaptation.

Discriminants

To evaluate a speaker model with respect to test data we use alikelihood ratio based discriminant function that takes into account theadded feature transformation. Given a set of vectors X={x_(t)}, t=1 . .. N_(test), in R^(n), the frame based discriminant function for anyindividual target model M^(j) is (equation no. 1):

$\begin{matrix}{{d\left( {x_{t}\text{}M^{j}} \right)} = {{\log \; p\left( {{T^{UBM}x_{t}\text{}m_{u^{*}}^{j}},{\sum\limits_{i^{*}}^{j}{,p_{i^{*}}^{j}}}} \right)} -}} \\{{\max_{i}\left\lbrack {\log \; {p\left( {{T^{UBM}x_{t}\text{}m_{i}^{UBM}},{\sum\limits_{i}^{UBM}{,p_{i}^{UBM}}}} \right)}} \right\rbrack}}\end{matrix}$

where the index i runs through the mixture components in the modelM^(UBM), i* is the maximizing index, and p(i) is a multi-variateGaussian density. Extending to the entire test data, gives (equation no.2):

${d\left( {X\text{}M^{j}} \right)} = {\frac{1}{N_{test}}{\sum\limits_{t = 1}^{N_{test}}{d\left( {x_{t}\text{}M^{j}} \right)}}}$

When used for verification, the result is compared to a threshold. Foridentification, the function is computed for all speakers “j” to findthe maximizing speaker score. We motivate the use of a filter as acomputationally significant mechanism to control resource usage bynoting that the above computation is required for each frame analyzed.There is an additional final score sorting cost for identification, butfor practical purposes, the number of frames will vastly outnumber thespeakers, maintaining the significance of frame reduction.

Filtering Data

The sequence of test data frames is denoted by {x_(t)}, t=1 . . .N_(test). Each element of the sequence has a label, such that labels andframes are in one to one correspondence. The labeling is assumed toproduce for each frame, and element I from an alphabet of labels L.Thus, X′={(x_(t),l_(t))}, t=1 . . . N_(test).

The speaker models are represented by the set M={M_(j)}. Let {x^(j)_(dev)} be development data for model M_(j) and {x_(dev)} be their unionover j. Define F_(SI)={l_(k)}, k=1 . . . N^(L) _(SI) (SI is speakerindependent) to be the set of labels defining the filter independent ofthe speaker being detected. N^(L) _(SI) is the total number of labels inthe filter. Let {X_(dev,l)} be the subset of the development datalabeled l. Then:

l ₁=arg max_(l) _(i) _(εL) perf({x _(dev,l) _(i) ≢),

(perf is the performance measure of interest) For the speakerindependent filter, this measure is the aggregate identification ratecomputed over all target models using the development data. Theparticular experiment used for optimization is a closed setidentification task among all target speakers. Continuing,

l _(n)=arg max_(l) _(i) _(εL−{l) ₁ _(. . . l) _(n-1) _(}) perf({x_(dev,l) _(i) }).

Similarly, F_(j)={l_(k)}, k=1 . . . N^(L) _(j) which is the set oflabels defining the filter for speaker j, defined as above with {x^(j)_(dev,l) _(i) } replacing {x_(dev,l) _(i) } (i.e. use data only fromspeaker j and label l_(i)) and the individual identification rate ofspeaker j replacing the aggregate rate for the performance measure.

The discriminant with speaker independent filtering becomes (equation3):

${d\left( {X\text{}M^{j}} \right)} = {\frac{1}{\sum\limits_{i = 1}^{N_{test}}{I\left( {x_{i} \in F_{SI}} \right)}}{\sum\limits_{t = 1}^{N_{test}}{{I\left( {x_{t} \in F_{SI}} \right)}{d\left( {x_{t}\text{}M^{j}} \right)}}}}$

and with speaker dependent filtering (equation 4):

${{d\left( {X\text{}M^{j}} \right)} = {\frac{1}{\sum\limits_{i = 1}^{N_{test}}{I\left( {x_{i} \in F_{j}} \right)}}{\sum\limits_{t = 1}^{N_{test}}{{I\left( {x_{t} \in F_{j}} \right)}{d\left( {x_{t}\text{}M^{j}} \right)}}}}}\;$

where I(.) is an indicator function of the validity of its argument,taken to mean that the vector's label is passed by the filter.

Thus target models are associated with an optimal set of labels withrespect to recognition performance. In cases where the identity of thespeaker sought is known, the data in the system can be filtered to passonly the optimal labels, F_(j4) for that speaker. Alternatively, ifthere is a set of speakers of interest, say the enrolled population,then an aggregate best list of labels, F_(S) ₁ , can be chosen. Silenceremoval, a common practice, would be a degenerate form of this type offiltering.

TABLE 1 Identification performance for various filtering configurations.phone rank 1 2 3 Name N IH T % data 6.21 5.74 5.69 % accuracy 72.6268.67 64.42 top 2% accuracy 82.43 top 3% accuracy 85.21

EXAMPLES

Setup

The data consisted of the audio portion of broadcast news database. Asubset of sixty four speakers were selected as the target speakers. Thewaveforms were mono 16 kHz PCM (Pulse-code modulation). The analysisconfiguration was 19 dimensional MFCC+1st derivative (38 dim vector)with feature warping. A rate of 100 frames per second, with 50% overlapwas used and the MFCC (Mel Frequency Cepstral Coefficients) werecomputed over a 20 millisecond window. For each speaker, two minutes ofdata were set aside and used for training the final models. Theremaining data was partitioned according to various criteria and usedfor testing. No labeled data was used in training. There were 683testing cells, ranging from 306 to 2034 frames (3.06 to 20.34 seconds).

Label Ranking

The data was labeled with an HMM (Hidden Markov Model) based ASR(Automatic Speech Recognition) system that generated alignments toavailable transcripts. As such, the labels are relatively high inquality. A set of 41 phonetic units were used: S TS UW T N K Y Z AO AYSH W NG EY B CH OY AX JH D G UH F V ER AA IH M DH L AH P OW AW HH AE THR IY EH ZH.

The first example details the per phone based recognition performancebased on the above labels using the 38 dimensional MFCC based features.The final speaker models did not use any label information. Table 1summerizes the identification performance for filtering configurationsbased on phonesets determined by aggregate performance, i.e. based onthe top labels in F_(SI). Using all of the data, the overallidentification performance was 92.53% correct. A further breakdown ofthe results shows, for the top three phones individually, the resultswere: 72.62% for “N”, 68.67% for “IH”, and 64.42% for “T” representingrespectively, 6.21, 5.74, and 5.69 percent of the total data. The topphones were determined based on ranking of aggregate performance on allspeakers. Scoring data from the top 2 phones combined, results in 82.43%accuracy on 11.96 percent of the data. The top 3 phones together give85.21% on 17.65 percent of the data.

Detection

In the case of speaker detection, the results can be broken down withrespect to speaker independent (aggregate best) phonesets and speakerdependent phonesets, which were determined as those for which therecognition rates were individually maximized. The equal error rate forthe baseline case where all of the data is used is 4.39%.

Speaker Independent Filtering

The examples above indicate the amount of throughput increase that canbe achieved by implementing the present invention. Consider theverification performance for 5 phonesets, ranging from the set with onlythe top performing (aggregate best) phone, to the set with the top 5phones. Each of these sets represents a data filter, and the amount ofdata passing the filter naturally increases with the number of elementsin the set. The performance improves as well. Filtering based on the topphone leaves 6.21% of the data with a corresponding equal error rate(EER) of 10.25%. Top 2 phones=11.96% data corresponding to 7.24% EER.Top 3=17.65% data →6.84% EER. Top 4=21.80% data →6.16% EER. Filteringbased on the top 5 phones leaves 31.91% with an EER Of 6.35%. We pointout, that for the speaker independent case, the performance of the top 4is better than that of the top 5. However, as will be seen in the nextexperiments, the top five set does indeed perform better then the topfour set for the speaker dependent filters.

Speaker Dependent Filtering

In the case of a detection problem, of which verification is an example,the amount of data can be further reduced and performance improved bytailoring the filters to the entity or object being detected, in thepresent case a speaker. In this case, filtering based on the top phone,which depends on the speaker, passes only 5.7% of the data resulting inan EER of 8.24%, top 2==10.84% data →6.78% EER, top 3=15.55% data →6.0%EER, top 4=20.11% data →5.93% EER, while the top 5 phones select 24.20%of the data for an EER of 5.45%. By having speaker dependent filters (ingeneral, filters tailored to the entity being detected), less data ispassed through the filters and better performance is achieved, ascompared to the speaker independent case. For the present configuration,a throughput increase of 400% can be achieved with a 1% increase in EER,as compared to the all data case.

Thus, a method is presented which reduces data flow and therebyincreases processing capacity, while preserving a high level of accuracyin a distributed speech processing environment. The method of thepresent invention includes filtering out (dropping) data frames based ona target speaker specific subset of labels using data filters whichresults show, preserves accuracy, and passes only a fraction of the databy optimizing target (specific) performance measures. Therefore, a highlevel of speaker recognition accuracy is maintained while utilizingexisting processing capabilities, and eliminating the need to increaseprocessing capabilities to analyze all data without the use of themethod of the present invention.

While the present invention has been particularly shown and describedwith respect to preferred embodiments thereof, it will be understood bythose skilled in the art that changes in forms and details may be madewithout departing from the spirit and scope of the present application.It is therefore intended that the present invention not be limited tothe exact forms and details described and illustrated herein, but fallswithin the scope of the appended claims.

1. A method for reducing the amount of speech data for detecting desiredspeakers comprising: receiving a plurality of speech data from aplurality of speakers; labeling the speech data; providing a speakerdependent filter which employs a label priority ranking of the labeledspeech data for each of the speakers, and the ranking is generated byprioritizing the labeled speech data according to a performancecriterion for each of the plurality of speakers; analyzing the labeledspeech data according to the priority rank of the label of the speechdata for a specific speaker, wherein specified lower priority labeledspeech data is dropped or filtered out such that less speech data isanalyzed.
 2. The method of claim 1 wherein the speakers are divided intogroups of speakers and the filters are generated and used for each groupof speakers.
 3. The method of claim 2 wherein the labels for the groupsof speakers are priority ranked.
 4. The method of claim 1 wherein thedata is pre-labeled.
 5. A speaker recognition system using a computer,which comprises: a plurality of speech data from a plurality of speakersstored on a computer; a labeler element for analyzing the speech dataand labeling the data, a speech data filter including a filter elementfor arranging the labels in a priority rank and the speech data filterdesigned according to a performance criterion for each of the pluralityof speakers; and a recognition element for analyzing the labeled data inaccordance with the priority rank as determined by the speech datafilter.
 6. The system of claim 5 wherein the labels are phonetic.